Skip to content

fix(train): support eval-only mode (--num-rollout 0)#2109

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/eval-only-num-rollout-zero
Open

fix(train): support eval-only mode (--num-rollout 0)#2109
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/eval-only-num-rollout-zero

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

What changed

get_optimizer_param_scheduler now computes the estimated training-iteration count once and clamps the scheduler-visible train_iters to at least 1 before deriving Megatron LR/WD schedule steps.

A CPU regression test (tests/test_eval_only_optimizer_scheduler.py) stubs Megatron's OptimizerParamScheduler, preserves its lr_decay_steps > 0 assertion, and is registered in the cpu-unittest matrix.

Why

train.py has an eval-only path for --num-rollout 0 with --eval-interval, but model and optimizer setup run before that branch. With num_rollout == 0, the old estimate produced train_iters == 0, then lr_decay_steps == 0, so Megatron aborted before eval could start.

The clamp only gives the scheduler a valid nonzero size for zero-estimated runs. It does not add training iterations; the training loop is still controlled by args.num_rollout. For normal configs that already estimate at least one optimizer step, the value is unchanged.

Validation

tests/test_eval_only_optimizer_scheduler.py covers both the eval-only startup case (num_rollout=0 no longer trips Megatron's scheduler assertion and sets train_iters == 1) and a normal training config where train_iters remains 16.

Fixes #1785

@EazyReal EazyReal force-pushed the fix/eval-only-num-rollout-zero branch from a658aac to 9e5f530 Compare June 20, 2026 18:21
@EazyReal EazyReal changed the title fix: support eval-only mode (--num-rollout 0) fix(train): support eval-only mode (--num-rollout 0) Jun 24, 2026
@EazyReal EazyReal force-pushed the fix/eval-only-num-rollout-zero branch from 9e5f530 to 1f59044 Compare June 24, 2026 03:18
@EazyReal EazyReal force-pushed the fix/eval-only-num-rollout-zero branch from 1f59044 to 6f3d1d3 Compare June 24, 2026 04:19
@EazyReal

Copy link
Copy Markdown
Contributor Author

@zhuzilin could you review this one? Eval-only mode with --num-rollout 0 still constructs the Megatron optimizer scheduler, which rejects zero lr_decay_steps; this keeps the training loop at zero rollouts while giving the scheduler the smallest valid shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] train.py num_rollout==0 error

1 participant